Bayesian Logistic Regression - Application in Probability Prediction of disease (Diabetes)

CapStone Project_2025

Author

Namita Mishra (Advisor: Dr. Ashraf Cohen)

Published

September 25, 2025

Slides: slides.html ( Go to slides.qmd to edit)


Introduction

Diabetes Mellitus (DM) is a major public health challenge, where risk factors (obesity, age, and race and gender) contribute to developing diabetes. Identifying risk factors (predictors) is essential for prevention and targeted intervention. Logistic regression is the standard tool used to estimate the association between risk factors and binary outcomes, such as the presence or absence of diabetes. However, classical maximum likelihood estimation (MLE) can be unstable when there are small sample sizes, missing data, or situations of quasi- or complete separation in the data.

Healthcare data includes wide variety of samples (DNA sequences, functional images of the brain, patient-reported outcomes, and electronic health records (EHR), survey records, sequences of health measurements, diagnoses, and treatments) are complex, and the standard approaches to analyze them are not adequate (Zeger?) et al. (2020) performed Bayesian Hierarchical Model and MCMC, by combining prior knowledge and patient data (EHR) to predict the patient’s health status, trajectory, and/or likely benefits of intervention and using multivariate longitudinal patient data using R-packages developed two-levels (1)time within person and (2) persons within a population along with co-variates and combined exogenous (age, clinical history) and endogenous (current treatment) variables on the individual’s multivariate health measurements. The model provided posterior distribution and an estimate of the marginal distribution of the regression coefficients by integration of Markov chain Monte Carlo (MCMC). The model was used to identify low-risk patient population with pneumonia etiology (children), prostate cancer, and mental disorders. Because it was entirely parametric (model limitations), an extensions to nonparametric or more flexible parametric models was recommended.

Bayesian Inference (parametric vs non-parametric) study conducted by Chatzimichail and Hatjimihail (2023) calculated the posterior probability of disease diagnosis by applying Bayesian inference via two modules comparing parametric (with a fixed set of parameters) and nonparametric distributions (which do not make a priori assumptions) on National Health and Nutrition Examination Survey data from two separate diagnostic tests on both diseased and non-diseased populations were used for model development. The mentioned conventional methods based on clinical criteria and fixed numerical thresholds limit the information captured on the intricate relationship between diagnostic tests and the varying prevalence of diseases. The probability distributions associated with quantitative diagnostic test outcomes overlap between the diseased and nondiseased groups. The dichotomous method fails to capture the complexity and heterogeneity of disease presentations across diverse populations. The applicability of the normal distribution (conventional method) is critiqued in dealing with skewness, bimodality, or multimodality. they reported Bayesian nonparametric (vs parametric) diagnostic modeling as a Flexible distributional modeling for test outcomes (posterior disease probabilities). Their models using Bayesian inference for posterior probability calculation used Wolfram Language and integrated prior probabilities of disease with distributions in both diseased and nondiseased populations which enabled them to evaluate the combined data from multiple diagnostic tests with improved diagnostic accuracy, precision, and adaptability. The model showed flexibility, adaptability, and versatility in the diagnostic. They found Nonparametric Bayesian models a better fit for data distributions given the limited existing literature. Model was reported robust in capturing complex data patterns, producing multimodal probability patterns for disease, unlike the bimodal, double-sigmoidal curves seen with parametric models. The study limitations is the reliance on parametric models, limited scholarly publications and over-dependence on prior probabilities increase the uncertainties, resulting in broader confidence intervals for posterior probabilities. They mentioned systemic bias (unrepresentative datasets), incomplete datasets and absence of normative data compromises the accuracy of results, reliability and validity of Bayesian diagnostic methods and that combined with other statistical and computational techniques could enhance diagnostic capabilities, Chatzimichail and Hatjimihail (2023)

To understand the Bayesian methodology an overview (stages, development and advantages) by Schoot et al. (2021) specify the importance of the priors, data modeling, inferences, model checking and refinement, selecting a proper sampling technique from a posterior distribution, variational inferences, variable selection, and its application across various research fields. The y applied Bayesian statistics across different fields (social sciences, ecology, genetics, medicine) in observed and unobserved parameters. They emphasize variable selection as a process of identifying the sub-set of predictors to include in a model especially where a large number of potential predictors are available. Unnecessary variables present issues such as multicollinearity, insufficient samples, overfitting the current data leading to poor predictive performance on new data making model interpretation difficult. Variables selection is best after checking correlations among the variables in the model (Eg: gene-to-gene interaction to predict genes in biomedical research).

Considering small sample size, Bayesian estimation with mildly informative priors is often. Based on the degree of (un)certainty (hyperparameters) surrounding the population parameter priors (informative, weakly informative and diffuse), prior distribution with a larger variance represents a greater amount of uncertainty. Prior elicitation could be through different ways (experts, generic expert, data-based, sample data using maximum likelihood or sample statistics). A prior sensitivity analysis of the likelihood that examines different forms of the model could assess how the priors and the likelihood align and impact on posterior estimates, reflecting variations not captured by the prior or the likelihood alone. Prior estimation allows data-informed shrinkage, regularization or influence algorithms towards a likely high-density region, and improves estimation efficiency. In a small sample i.e. less information, incorporation of priors strengthens the observed data and lends possible value(s) for the unknown parameter(s). To know probabilistic specification of the priors for a complex model with smaller sample sizes is important. In Bayesian inference, unknown parameters (random variables) have varied values, while the (observed) data have fixed values. The likelihood is a function of θ for the fixed data y. The likelihood function summarizes a statistical model that stochastically generates a range of possible values for θ and the observed data y. With priors and the likelihood of the observed data, the resulting posterior distribution provides an estimate of the unknown parameters, capture primary factors to improve our understanding. Monte Carlo technique provides integrals of sampled values from a given distribution through computer simulations. The packages BRMS and Blavaan in R are used for the probabilistic programming language Stan. MCMC algorithm only requires the probability distribution of interest to be specified up to a constant of proportionality and is scalable to high dimensions to obtain empirical estimates of the posterior distribution of interest. Bayesian inference adopts a simulation-based strategy for approximating posterior distributions.Frequentists do not consider the probability of the unknown parameters and consider to be fixed and likelihood is considered as the the conditional probability distribution p(y|θ) of the data (y), given fixed parameters (θ). Spatial and temporal variability are factored in Bayesian general linear models. A posterior distribution can simulate new data conditional on this distribution and assess, to providing valid predictions. The Bayesian approach in analyzing large-scale cancer genomic data, identifies novel molecular changes in cancer initiation and progression, the interactions between mutated genes, captured mutational signatures, key genetic interactions components, allow genomic-based patient stratification (clinical trials, personalized use of therapeutics) and in understanding cancer evolutionary processes. The mentioned the model reproducibility, reporting standards, and outlined a checklist. Limitations were related to the dependencies (autocorrelation of parameters over time in temporal model) and the subjectivity issue of priors.

The study by Klauenberg et al. (2015) emphasizes prior elicitation, analytical posteriors, robustness checks through guidance provided on Bayesian inference by performing Bayesian Normal linear regression, Core parametric (conjugate)model with Normal–Inverse-Gamma prior in metrology to calibrate instruments to evaluate inter-laboratory comparisons in determining fundamental constants.

In gaussian- Errors are independent and identically distributed, variance is unknown and is estimated from data, the relationship between X and Y is statistical, with noise and model uncertainty and the regression can not be treated as a measurement function . They mentioned statistical approaches (likelihood, Bayesian, bootstrap, etc.) could quantify uncertainty since Guide to the Expression of Uncertainty in Measurement (GUM) and its supplements are not applicable directly. They emphasized Bayesian inference that accounts for a priori information, and robustifies the analyses through steps including prior elicitation, posterior calculation, and robustness to prior uncertainty and model adequacy for the model development and about assumptions critical to Bayesian inference.

In Bayesian, unknowns such as (observables: data and unobservables: parameters and auxiliary variables) are random, are assigned probability distributions of the available information, and update prior knowledge about the unobservables with information about them contained in the data. The graphical representation of prior distribution and likelihood function, sensitivity analyses, or model checking enhances the elicitation and interpretation process.

For Normal linear regression (1) Normal inverse Gamma (NIG) distribution to a posterior is from the same family of (NIG) distribution. The NIG prior with known variance σ2 of observations is a conjugate prior distribution. Vague or non-informative prior distributions can be derived from NIG prior (2) alternative families (hierarchical priors) assign an additional layer of distributions to uncertain prior parameters or non-parametricriors.

Bayesian inference is influenced by - the uncertainty in the transformation of prior knowledge to prior distributions - the assumptions of the statistical model - the mistakes in data acquisition

Bayesian Hierarchical / meta-analytic linear regression model in a study by (DeLeeuw2012awas?) was developed to test of a formal method for augmenting data in linear regression analyses, by incorporating both exchangeable and unexchangeable information on regression coefficients (and standard errors) of previous studies.

They highlighted issues of multiple testing resulting in relatively low statistical power, which is problematic in null-hypothesis significance testing. In multiple linear regression models with separate significance tests for all regression coefficients, with the modest sample sizes, in different studies with different sets of statistically significant predictors, and addressing for larger samples is practically unrealistic. Linear regression analyses do not account summary statistics from similar previous studies and ignoring past information on parameters affects the stability and precision of the parameter estimates and report lower values, are less certain and are affected by sampling variation.

The study conducted Bayesian linear regression by accommodating prior knowledge to overcome the absence of formal studies, to handle issues of increasing the sample size. They augmented the data of a new study and incorporated priors on regression coefficients and standard errors from previous similar studies.

To solve the issue of the univariate case analysis and the issue of simultaneously combining multiple regression parameters per study, which ignore the relationship between the regression coefficients, Bayesian linear regression combined with the evidence of specific predictors from different linear regression analyses (meta-analysis) was conducted. Adding summary statistics from previous studies provided a more acceptable solution to when previous study data are not (realistically) obtainable.

Based on the information on predictors from previous and current data, the models are categorized into (1) Exchangable - when the current data and previous studies have the same set of predictors. (2) Unexchangable – when the predictors differ in the two studies.

They emphasize steps -

  1. To calculate the probability density function for the data, given the unknown model parameters; the Standard multiple linear regression model, integrate the prior, and provide the joint posterior density using the Gibbs sampler.
  2. The likelihood function - that quantifies what is assumed to be known about the model parameters before observing the data.
  3. A hierarchical model use to analyze parameters where studies are not-exchangeable.

They found incorporating priors in a linear regression on new data yield a significantly better parameter estimate with an adequate approximation, encouraging performance gains and the large effects.

Performance of the two versions (exchangeable and unexchangeable) of the replication model was consistently superior to using the current data alone.

The model using exchangeable and unexchangeable prior offers better parameter estimates in a linear regression setting without the need to expend a large amount of time and energy to obtain data from the previous studies. Hierarchical unexchangeable model offers the advantage of being able to address questions about differences between studies and thus allows for explicit testing of the exchangeability assumption.

Limitations were based on having the same set of predictors and correlation between predictor variables. Leeuw and Klugkist (2012)

Bayesian logistic regression (Bayesian GLM) (Sequential clinical reasoning approach) model in longitudinal prospective cohort was developed to predict the risk of incident cardiovascular disease. They developed 3 models : (1) demographic features (basic model) (2) six metabolic syndrome components (metabolic score model) (3) conventional risk factors (enhanced model). The method was based on the application of Logistic Regression including priors on coefficients and sequential updating to predict individual-level CVD risk.

In Liu et al. (2013) study, the author developed model to overcome the issue of limited availability of molecular information in clinical practice (high cost and unavailability) that affects efficient disease diagnosis. They mentioned an alternative approach to analyze data to efficiently identify a high-risk population based on the routinely checked biological markers before doing these expensive molecular tests.

Model such as Framingham Risk Score method were not sufficient because of Heterogeneity (geographic, ethnic group, variations, and social contextual network) observed was often as unobservable and unmeasurable and required to construction separate models.

They developed and applied the Sequential clinical reasoning approach model on subjects enrolled in a screening program (Keelung Community) (20–79 years) in the Keelung city of Taiwan, and were analyzed for 5 years to identify incident cancers and chronic diseases (cardiovascular disease). The study classified incident CVD cases by incorporating (1) standardized risk score (MetS: fasting glucose, blood pressure, HDL-C, triglyceride and waist circumference) (2) risk factors: gender, heredity, smoking, alcohol drinking, family history and betel quid.

The methodology used is the Bayesian clinical reasoning approach based on a sequential manner to developed three models that emulated a clinician’s evaluation process. The Bayesian clinical reasoning approach considered the normal distribution of regression coefficients of all predictors, allowing for uncertainty of clinical weights and the credible intervals of predicted risk estimates were averaged.

In the model, the individual risk is elicited by prior speculation (first impression) that is updated by objective observed data (patient’s history and laboratory findings), the regression coefficients for computing risk score were treated as random variable with normal distribution rather than a fixed value (traditional risk prediction model by frequentist). The updated prior distribution with the likelihood of the current data provided a posterior distribution to predict the risk for a specific disease.

On comparing the three models, the enhanced model had better performance where conventional risk factors were incorporated (enhanced model). The proposed models predicted CVD incidence at the individual level, and incorporated routine information with a sequential Bayesian clinical reasoning approach. Patients’ background contributed significantly to baseline risk. Even with ecological heterogeneity, the regression model adopted individual characteristics and made individual risk prediction for the CVD incidence. The limitation of the model mentions the interactions between predictors and cross-validated through external validation by applying the proposed models to new subjects not included in the training of the model parameters.

Bayesian Multiple Imputation and Logistic regression was conducted by Austin et al. (2021) on missing data, from clinical research. They mentioned missing values are not measured or recorded for all subjects are due to varied reasons: (i) patient refusal to respond to specific questions (ii) loss of patient to follow-up; (iii)investigator or mechanical error (iv) physicians not ordering certain investigationsfor some patients. The study emphasizes on understanding the type of missing data (MAR, MNAR, MCAR). Multiple imputation (MI) address missing data, provides multiple plausible values for missing values of a given variable resulting in creation of multiple completed data sets, conducts identical statistical analyses on each completed data sets, and the pooled results from across complete data sets, are then analyzed. The study conducted MI, and mentioned steps through its implementation and development, emphasizing on the number of imputed data sets and to create, and addresses derived variables. They provided application of MI in analyzing patients hospitalized with heart failure to estimate the probability of 1-year mortality in the presence of missing data using (R, SAS, and Stata)Austin et al. (2021).

Our study applies Bayesian logistic regression on a dataset with quasi-separation issue, missing values and the the resultant small sample size. It is a retrospective study, aimed to analyze the relationship between diabetes status and key predictors (body mass index (BMI), age, gender, and race), using NHANES survey data collected between 2013-2014. On exploring the dataset, which initially had 9813 observations and selected 5 variables, later revealed that the effective sample size was reduced due to complete case analysis and listwise deletion of missing data. Considering the small sample size and quasi-separation issue is as a challenge for traditional logistic regression models as complete-case model showed evidence of quasi-separation, with implausibly large coefficients and unstable estimates. This motivated us to apply multiple imputation (MICE) and conduct Bayesian logistic regression to provide a flexible framework for modeling uncertainty and incorporating prior knowledge and avoiding the issue of separation and small dataset.

Method and Data Preparation

Statistical Tool used is R, R packages, and libraries to import, manage and analyze the data. Data collected is from NHANES 2-year cross-sectional data (2013-2014 year) using 3 datasets (demographics, exam, questionnaire) Center for Health Statistics (1999). Using haven package .XPT files were imported in r-studio modified to dataframe (df).

Data Management

Subsets created from the original weighted 3 datasets (demographics, exam, questionnaire) were merged into a single dataframe for analysis and exploration. The merged dataframe included selected variables of interest, was cleaned, variables categorized, and recoded and analyzed. Basic statistics, anamolies and patterns reported before running Bayesian regression. Final dataset included weighted means of all selected variables of interest. Below tabel describes the details of all variable in the data.

  1. Response Variable (Binary, Diabetes) was defined as - “Is there one Dr you see for diabetes”

  2. Predictor Variables (Body Mass Index, factor, 4 levels)

    The original data has BMDBMIC (measured BMI) as categorical and had no missing values. It (BMI_cat) has the following 4 levels:
    o Underweight (<5th percentile)
    o Normal (5th–<85th)
    o Overweight (85th–<95th) o Obese (≥95th percentile)
    o Missing We kept it as it is as categorization provides clinically interpretable groups

  3. Covariates:

  • Gender (factor, 2 levels): Male: Female
  • Ethnicity (factor, 5 levels): Mexican American, Non-Hispanic, White Non-Hispanic, Black Other Hispanic, Other Race - Including Multi-Racial
  • Age (num, continuous)
Code
# formation of table with variable details

variables <- c("SEQN", "BMDBMIC", "RIDAGEYR", "RIAGENDR", "RIDRETH1", 
               "SDMVPSU", "SDMVSTRA", "WTMEC2YR", "DIQ240")

df <- data.frame(Variable = variables, Description = descriptions <- c("Respondent sequence number", 
        "BMI calculated as weight in kilograms divided by height in meters squared, and then rounded to one decimal place.", 
                  "Age Age in years of the participant at the time of screening. Individuals 80 and over are topcoded at 80 years of age.", 
                  "Gender", 
                  "Race/ethnicity Recode of reported race and Hispanic origin information", 
                  "Sample PSU", 
                  "Sample strata", 
                  "MEC exam weight", 
                  "Diabetes status Is there one doctor or other health professional {you usually see/SP usually sees} for {your/his/her} diabetes? Do not include specialists to whom {you have/SP has} been referred such as diabetes educators, dieticians or foot and eye doctors."))
df %>%
  gt %>%
  tab_header(
    title = "Table Variable Description"
  ) %>%
  tab_footnote(
    footnote = "Each variable in the dataset, accompanied by a qualitative description from the study team."
  )
Table Variable Description
Variable Description
SEQN Respondent sequence number
BMDBMIC BMI calculated as weight in kilograms divided by height in meters squared, and then rounded to one decimal place.
RIDAGEYR Age Age in years of the participant at the time of screening. Individuals 80 and over are topcoded at 80 years of age.
RIAGENDR Gender
RIDRETH1 Race/ethnicity Recode of reported race and Hispanic origin information
SDMVPSU Sample PSU
SDMVSTRA Sample strata
WTMEC2YR MEC exam weight
DIQ240 Diabetes status Is there one doctor or other health professional {you usually see/SP usually sees} for {your/his/her} diabetes? Do not include specialists to whom {you have/SP has} been referred such as diabetes educators, dieticians or foot and eye doctors.
Each variable in the dataset, accompanied by a qualitative description from the study team.
  • Using library(survey), weighted means and sd of all variables are presented in the merged dataframe with 9813 observations.
Code
# weighted means of each variable                       
nhanes_design <- svydesign(
  id = ~SDMVPSU,
  strata = ~SDMVSTRA,
  weights = ~WTMEC2YR,
  data = merged_data,
  nest = TRUE
)

svymean(~RIDAGEYR , design = nhanes_design, na.rm = TRUE)
           mean     SE
RIDAGEYR 37.504 0.4412
Code
svymean(~BMDBMIC, design = nhanes_design, na.rm = TRUE)
                         mean     SE
BMDBMICUnderweight   0.037703 0.0037
BMDBMICNormal weight 0.628530 0.0120
BMDBMICOverweight    0.162217 0.0056
BMDBMICObese         0.171551 0.0109
Code
svymean(~RIAGENDR, design = nhanes_design, na.rm = TRUE)
                  mean     SE
RIAGENDRMale   0.48861 0.0065
RIAGENDRFemale 0.51139 0.0065
Code
svymean(~RIDRETH1, design = nhanes_design, na.rm = TRUE)
                                                mean     SE
RIDRETH1Mexican American                    0.110450 0.0199
RIDRETH1Other Hispanic                      0.060637 0.0106
RIDRETH1Non-Hispanic White                  0.622315 0.0365
RIDRETH1Non-Hispanic Black                  0.120895 0.0173
RIDRETH1Other Race - Including Multi-Racial 0.085704 0.0076
Code
svymean(~DIQ240, design = nhanes_design, na.rm = TRUE)
             mean    SE
DIQ240Yes 0.78759 0.021
DIQ240No  0.21241 0.021

Raw Data Exploration and Visualization

  • Statistical summary, visualizations (histogram, bar plot) of raw data are presented Figure 4
  • Most cases are non-hispanic whites.
  • Of those who reported their diabetes status, shows more counts reported having diabetes.
  • Genders are relatively evenly distributed.
  • Majority population were in the normal range of BMI
  • Histograms (Figure 5) shows frequency distributions for age, with slight right skewness.
  • Breakdown of Missingness is presented in the graph.
Code
str(merged_data)
'data.frame':   9813 obs. of  9 variables:
 $ SEQN    : num  73557 73558 73559 73560 73561 ...
 $ BMDBMIC : Factor w/ 4 levels "Underweight",..: NA NA NA 2 NA NA NA NA NA NA ...
 $ RIDAGEYR: num  69 54 72 9 73 56 0 61 56 65 ...
 $ RIAGENDR: Factor w/ 2 levels "Male","Female": 1 1 1 1 2 1 1 2 2 1 ...
 $ RIDRETH1: Factor w/ 5 levels "Mexican American",..: 4 3 3 3 3 1 3 3 3 3 ...
 $ SDMVPSU : num  1 1 1 2 2 1 1 1 1 2 ...
 $ SDMVSTRA: num  112 108 109 109 116 111 105 114 112 112 ...
 $ WTMEC2YR: num  13481 24472 57193 55767 65542 ...
 $ DIQ240  : Factor w/ 2 levels "Yes","No": 1 1 2 NA NA NA NA NA NA NA ...
Code
plot_str(merged_data)
introduce(merged_data)
  rows columns discrete_columns continuous_columns all_missing_columns
1 9813       9                4                  5                   0
  total_missing_values complete_rows total_observations memory_usage
1                15381            14              88317       554936
Code
plot_intro(merged_data, title="Figure 1 (Merged dataset). Structure of variables and missing observations.")

Code
plot_missing(merged_data, title="Figure 2(Merged dataset). Breakdown of missing observations.")

Explain your data preprocessing and cleaning steps.

Considering special codes are not random and cannot be dropped, they were transformed into NAs (based on the variable codebook). All NAs were included in the analysis, since, R automatically excludes rows with NA during during regression. We conducted linear regression lm (), and (listwise deletion or complete case analysis) resulted in a reduced sample size (n=14). We observed quasi-separation (warning) on our dataset. Since it could introduce bias as the informative missingness if ignored (MAR or MNAR)

The results of the clean dataset with all NAs removed and the breakdown of the missingness (Graph)

Code
plot_intro(clean_data, title="Figure 6. Breakdown of missing observations.")

The graph shows the data after removing all the NAs.

Code
# Count NAs in all columns

colSums(is.na(merged_data))
    SEQN  BMDBMIC RIDAGEYR RIAGENDR RIDRETH1  SDMVPSU SDMVSTRA WTMEC2YR 
       0     6290        0        0        0        0        0        0 
  DIQ240 
    9091 
Code
colSums(is.na(clean_data))
    SEQN  BMDBMIC RIDAGEYR RIAGENDR RIDRETH1  SDMVPSU SDMVSTRA WTMEC2YR 
       0        0        0        0        0        0        0        0 
  DIQ240 
       0 

Because of small sample after NAs are removed we continue with the raw data. Descriptive statistics (counts, frequencies, proportions, mean and sd ). Visualization of each variable and the intervalriable proportions and between variable/s and outcome are presented here.

Code
# cross-tabulation #
tab1 <- table(merged_data$diab, merged_data$BMI_cat,useNA="ifany")
prop.table(tab1) * 100
      
       Underweight Normal weight  Overweight       Obese        <NA>
  Yes   0.00000000    0.05095282  0.03057169  0.02038113  5.53347600
  No    0.00000000    0.02038113  0.02038113  0.00000000  1.68144298
  <NA>  1.34515439   22.01161724  6.01243249  6.38948334 56.88372567
Code
tab2<-table(merged_data$race, merged_data$diab, useNA="ifany")# before cleaning and imputation
prop.table(tab2) * 100
                                     
                                             Yes         No       <NA>
  Mexican American                     0.8763885  0.4178131 15.8768980
  Other Hispanic                       0.4891470  0.1528585  8.8352186
  Non-Hispanic White                   2.0992561  0.5706716 33.3842862
  Non-Hispanic Black                   1.4878223  0.3668603 20.5441761
  Other Race - Including Multi-Racial  0.6827678  0.2140018 14.0018343
Code
tab3 <- table(merged_data$gender, merged_data$diab, useNA="ifany")
prop.table(tab3) * 100
        
                Yes         No       <NA>
  Male    2.6903088  0.8865790 45.6537247
  Female  2.9450729  0.8356262 46.9886885
Code
# Create a combined table of counts and percentages
vars <- c("BMI_cat", "race", "gender")
percent_table <- merged_data %>%
  select(all_of(vars), diab) %>%
  pivot_longer(cols = all_of(vars), names_to = "Variable", values_to = "Category") %>%
  group_by(Variable, Category, diab) %>%
  summarise(Count = n(), .groups = "drop") %>%
  group_by(Variable, Category) %>%
  mutate(Percent = round(100 * Count / sum(Count), 1)) %>%
  ungroup()

# Fancy Tables #
kable(percent_table, caption = "Table 2. Percentage based on categories of Variables")
Table 2. Percentage based on categories of Variables
Variable Category diab Count Percent
BMI_cat Underweight NA 132 100.0
BMI_cat Normal weight Yes 5 0.2
BMI_cat Normal weight No 2 0.1
BMI_cat Normal weight NA 2160 99.7
BMI_cat Overweight Yes 3 0.5
BMI_cat Overweight No 2 0.3
BMI_cat Overweight NA 590 99.2
BMI_cat Obese Yes 2 0.3
BMI_cat Obese NA 627 99.7
BMI_cat NA Yes 543 8.6
BMI_cat NA No 165 2.6
BMI_cat NA NA 5582 88.7
gender Male Yes 264 5.5
gender Male No 87 1.8
gender Male NA 4480 92.7
gender Female Yes 289 5.8
gender Female No 82 1.6
gender Female NA 4611 92.6
race Mexican American Yes 86 5.1
race Mexican American No 41 2.4
race Mexican American NA 1558 92.5
race Other Hispanic Yes 48 5.2
race Other Hispanic No 15 1.6
race Other Hispanic NA 867 93.2
race Non-Hispanic White Yes 206 5.8
race Non-Hispanic White No 56 1.6
race Non-Hispanic White NA 3276 92.6
race Non-Hispanic Black Yes 146 6.6
race Non-Hispanic Black No 36 1.6
race Non-Hispanic Black NA 2016 91.7
race Other Race - Including Multi-Racial Yes 67 4.6
race Other Race - Including Multi-Racial No 21 1.4
race Other Race - Including Multi-Racial NA 1374 94.0
Code
kable(final_table, caption = "Table 1. Descriptive Statistics of Study Variables")
Table 1. Descriptive Statistics of Study Variables
Variable Category Count proportion Mean SD Min Max Proportion
Diabetes Status Yes 553 0.056 NA NA NA NA NA
Diabetes Status No 169 0.017 NA NA NA NA NA
Diabetes Status NA 9091 0.926 NA NA NA NA NA
Age NA NA NA 31.62957 24.39755 0 80 NA
Gender Male 4831 NA NA NA NA NA 0.492
Gender Female 4982 NA NA NA NA NA 0.508
BMI Category Underweight 132 NA NA NA NA NA 0.013
BMI Category Normal weight 2167 NA NA NA NA NA 0.221
BMI Category Overweight 595 NA NA NA NA NA 0.061
BMI Category Obese 629 NA NA NA NA NA 0.064
BMI Category NA 6290 NA NA NA NA NA 0.641
Race Mexican American 1685 NA NA NA NA NA 0.172
Race Other Hispanic 930 NA NA NA NA NA 0.095
Race Non-Hispanic White 3538 NA NA NA NA NA 0.361
Race Non-Hispanic Black 2198 NA NA NA NA NA 0.224
Race Other Race - Including Multi-Racial 1462 NA NA NA NA NA 0.149

Visualization of the data -Bar plot of individual variables. -Box plot for cross-tabulation results (Diabetes status vs individual variables- age , gender, race and BMI) -Higher proportion of < 20 years age reported being Mexican American

Code
## Bar plot of age, gender, race, diabetes status, BMI ## 

plot_bar(merged_data, title = "Figure 3(Merged dataset). Frequency plots of categorical variables.")

Code
# Visualization Cross-Tabulation #

# DIQ240 by Age #
ggplot(merged_data, aes(x = age , y = factor(diab))) +
  geom_boxplot(fill = "lightblue") +
  theme_minimal() +
  labs(title = "Age by Diabetes Status", x = "Age", y = "Diabetes (2 = No, 1 = Yes)" )

Code
#  DIQ240 by BMI  #

ggplot(merged_data, aes(x = BMI_cat  , fill = factor(diab))) +
  geom_bar(position = "fill") +
  theme_minimal() +
  labs(title = "Diabetes by BMI Category", x = "BMI Category", y = "Proportion", fill = "diab")   

Code
# DIQ240 by Race  #
ggplot(merged_data, aes(x = race, fill = factor(diab)  )) +
  geom_bar(position = "fill") +
  theme_minimal() +
  labs(title = "Diabetes by Race ", x = "Race ", y = "Proportion", fill = "diab")

Code
# DIQ240 by Gender

ggplot(merged_data, aes(x = gender, fill = factor(diab)  )) +
  geom_bar(position = "fill") +
  theme_minimal() +
  labs(title = "Diabetes by gender ", x = "gender ", y = "Proportion", fill = "diab")

Code
ggplot(merged_data) + 
    geom_point(aes(x = merged_data$age,  y = merged_data$diab), 
               na.rm = TRUE,
               show.legend = TRUE,
               position = "jitter") 

Code
ggplot(merged_data) + 
    geom_point(aes(x = merged_data$race,  y = merged_data$BMI_cat), 
               na.rm = TRUE,
               show.legend = TRUE,
               position = "jitter") 

Code
ggplot(merged_data %>% filter(age < 20), aes(x = race, y = BMI_cat, color = age)) +
  geom_jitter(alpha = 0.6, width = 0.2, height = 0.2, na.rm = TRUE) +
  labs(x = "Race", y = "BMI Category", color = "Age") +
  theme_minimal()

Code
# Plots of cross tabulation between variables # 

# gender vs race 
ggplot(merged_data) +
    geom_bar(aes(x = gender, fill = race))

Code
# gender vs BMI_cat
ggplot(merged_data) +
    geom_bar(aes(x = gender, fill = BMI_cat))

Code
#  race vs BMI_cat
ggplot(merged_data) +
    geom_bar(aes(x = race, fill = BMI_cat))

Code
# line graph for age 
ggplot(merged_data) + 
    geom_freqpoly(aes(x = age), binwidth = 5, na.rm = TRUE)  # age distribution#

Code
#  race vs age
ggplot(merged_data)+ 
    geom_bar(aes(x = age , fill = race))

Code
ggplot(merged_data %>% filter(age < 20)) +
  geom_bar(aes(x = age, fill = race), na.rm = TRUE) +
  labs(x = "Age (<20)", y = "Count", fill = "Race") +
  theme_minimal()

Frequentist method

  1. Multiple logistic regression
  • Multiple linear regression on raw dataset, resulted in small sample size (complete case analysis and listwise deletion of NAs): (presented are first 6 deleted rows)
  • The data resulted in quasi-separation. Van Buuren and Van Buuren (2012).
  • Explored of the cause of missingness, revealed missing at random and missing not at random (MAR and MNAR) whic as reported previously are common in healthcare and public health datasets: (plot on missingness - see below)
  1. Baseline regression model (Only BMI to predict diabetes)
  • We conducted baseline model regression to know whether predictors significantly improve predictive power.
  • Null deviance = 16.75 (baseline fit).
  • Residual deviance = 15.11 (with BMI).
  • presents that the drop is small and that BMI category adds very little predictive value over just assuming the overall diabetes prevalence.
  1. Firth (penalized) regression

Firth (penalized) regression was considered to handle quasi-separation, D’Angelo and Ran (2025). Firth regression, a frquentist approach that use Jeffreys prior for bias correction. It does not provide posterior and no sampling using MCMC (vs) bayesian logisitic regression.

Below are the summaries from the MLR (raw data), Baseline model regression, Firth regression

Code
# MLR # and # quasi-complete separation #
m_raw<- glm(diab ~ BMI_cat + age + gender + race,
                       family = binomial, data = merged_data)  
summary(m_raw)

Call:
glm(formula = diab ~ BMI_cat + age + gender + race, family = binomial, 
    data = merged_data)

Coefficients:
                                          Estimate Std. Error z value Pr(>|z|)
(Intercept)                                 94.694 312462.017       0        1
BMI_catOverweight                           50.968 296132.423       0        1
BMI_catObese                               -20.140 219692.181       0        1
age                                         -9.984  31870.105       0        1
genderFemale                                -4.011 642892.260       0        1
raceOther Hispanic                         -46.080 693578.425       0        1
raceNon-Hispanic White                     -50.787 363135.953       0        1
raceNon-Hispanic Black                      73.273 571735.835       0        1
raceOther Race - Including Multi-Racial     13.094 570685.423       0        1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1.6752e+01  on 13  degrees of freedom
Residual deviance: 2.2593e-10  on  5  degrees of freedom
  (9799 observations deleted due to missingness)
AIC: 18

Number of Fisher Scoring iterations: 25
Code
# Baseline Model - regression # 
m_raw1<- glm(diab ~ BMI_cat,
                       family = binomial, data = merged_data)  
summary(m_raw1)

Call:
glm(formula = diab ~ BMI_cat, family = binomial, data = merged_data)

Coefficients:
                   Estimate Std. Error z value Pr(>|z|)
(Intercept)         -0.9163     0.8367  -1.095    0.273
BMI_catOverweight    0.5108     1.2383   0.413    0.680
BMI_catObese       -17.6498  4612.2021  -0.004    0.997

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 16.752  on 13  degrees of freedom
Residual deviance: 15.106  on 11  degrees of freedom
  (9799 observations deleted due to missingness)
AIC: 21.106

Number of Fisher Scoring iterations: 17
Code
# Firth logistic regression
library(logistf) # penalized regression #

m_firth <- logistf(diab ~ BMI_cat + age + gender + race,
                   data = merged_data)
summary(m_firth)
logistf(formula = diab ~ BMI_cat + age + gender + race, data = merged_data)

Model fitted by Penalized ML
Coefficients:
                                              coef  se(coef)  lower 0.95
(Intercept)                              2.2909182 2.0520292  -1.7059362
BMI_catOverweight                        2.5151817 1.9838969  -1.2146946
BMI_catObese                            -0.4519637 1.8778881  -5.4843160
age                                     -0.2545704 0.1911568  -1.4579663
genderFemale                            -2.1675543 2.0190594 -17.0316807
raceOther Hispanic                       0.9518795 1.9225099 -12.0194116
raceNon-Hispanic White                  -2.5671499 2.2189914 -21.6563170
raceNon-Hispanic Black                   3.3280688 2.2005227  -0.4678607
raceOther Race - Including Multi-Racial  1.3072954 2.0609641  -2.6483209
                                        upper 0.95      Chisq          p method
(Intercept)                             16.1697706 1.25690744 0.26223728      2
BMI_catOverweight                       21.9385716 1.68994830 0.19360778      2
BMI_catObese                             3.4687761 0.05538832 0.81393924      2
age                                      0.1148471 1.81192970 0.17827692      2
genderFemale                             5.9967059 0.69556457 0.40427809      2
raceOther Hispanic                      10.8617660 0.20246895 0.65273532      2
raceNon-Hispanic White                   1.6700624 1.36287409 0.24304000      2
raceNon-Hispanic Black                  25.3575974 2.86971841 0.09026066      2
raceOther Race - Including Multi-Racial 21.7820826 0.39215076 0.53117103      2

Method: 1-Wald, 2-Profile penalized log-likelihood, 3-None

Likelihood ratio test=6.066658 on 8 df, p=0.6397653, n=14
Wald test = 5.257182 on 8 df, p = 0.7297677

Analysis and Results

Unexpected reports, patterns or anomalies in the raw data

  • Issue of quasi-complete separation in the data (9799 observations dropped)
  • Reduced sample size with reduced number of complete cases (n=14).
  • The model is overfitted to this subset and cannot be generalized.
  • Huge coefficients (e.g., 94, –50, 73) and the tiny residual deviance suggest perfect separation and sparse data in some categories with very few observations, resulted in imbalance in the outcome (very few cases of 0 or 1).
  • Logistic regression cannot estimate stable coefficients when predictors perfectly classify the outcome.
  • Firth regression dealt with quasi-separation with coefficients as finite, but the sample size was reduced (n= 14) where estimates are highly uncertain, wide confidence intervals → cannot make strong claims about predictor effects.

Interpretation Across Models with only 14 non-missing cases - could not be trusted because of small sample size and the separation problem - Models with all predictors together and with sequential adding of predictors, all models showed unstable and extreme estimates with standard errors not meaningful. - Adding more predictors makes the deviance drop but indicated overfitting / separation, not true explanatory power. - BMI alone contributes very little - Race and gender make models appear stronger, but was based on small sample (n=14) and shows a case of complete separation, not generalizable evidence.

We decided to perform imputation, to retain full N = ~9813 to deal with small sample size and avoid quasi-separation.

Multivariate Imputation by Chained Equations (MICE) Buuren and Groothuis-Oudshoorn (2011) - Bayesian Approach

  • Multiple imputation (MI) is considered as an alternative approach for the given raw dataset. Flatness of the density, heavy tails, non-zero peakedness, skewness and multimodality do not hamper the good performance of multiple imputation for the mean structure in samples n > 400 even for high percentages (75%) of missing data in one variable @Van Buuren and Van Buuren (2012).
  • Multiple Imputation (MI) is a Bayesian Approach (use popular mice package in R) and adds sampling variability to the imputations.
  • Iterative Imputation (MICE) imputes missing values of one variable at a time, using regression models based on the other variables in the dataset.
  • This is a chain process, with each imputed variable becoming a predictor for the subsequent imputation and the entire process is repeated multiple times to create several complete datasets, each reflecting different possibilities for the missing data.
  • Each variable is imputed using its own appropriate univariate regression model.

Results from MICE:

  • MI resulted in filling imputed values with resulting 9813 observations with no NAs. A comparative bar plot on missingness in the raw data and the imputed data is presented below.

  • A heatmap of the imputed dataset generated a correlation between BMI categories and Diabetes status. BMI dummy variables are strongly negatively correlated.

  • There was no strong linear association between BMI category and diabetes in the dataset. Chi-square calculation of categorical varaibels revealed p-value = 0.5461, which is > 0.05 with no evidence of association.

Code
## Multiple Imputation performed 

library(mice)
library("VIM")

# Subset variables for imputation in analytic_data df

vars <- c("id","BMI_cat",  "age", "gender", "race", "SDMVPSU",  "SDMVSTRA", "WTMEC2YR", "diab"  )
analytic_data <- merged_data[, vars]

# Run mice to create 5 imputed datasets

imputed_data <- mice(
  analytic_data,
  m = 5,              # number of imputed datasets
  method = 'pmm',     # predictive mean matching
  seed = 123
)

 iter imp variable
  1   1  BMI_cat  diab
  1   2  BMI_cat  diab
  1   3  BMI_cat  diab
  1   4  BMI_cat  diab
  1   5  BMI_cat  diab
  2   1  BMI_cat  diab
  2   2  BMI_cat  diab
  2   3  BMI_cat  diab
  2   4  BMI_cat  diab
  2   5  BMI_cat  diab
  3   1  BMI_cat  diab
  3   2  BMI_cat  diab
  3   3  BMI_cat  diab
  3   4  BMI_cat  diab
  3   5  BMI_cat  diab
  4   1  BMI_cat  diab
  4   2  BMI_cat  diab
  4   3  BMI_cat  diab
  4   4  BMI_cat  diab
  4   5  BMI_cat  diab
  5   1  BMI_cat  diab
  5   2  BMI_cat  diab
  5   3  BMI_cat  diab
  5   4  BMI_cat  diab
  5   5  BMI_cat  diab
Code
str(imputed_data)
List of 23
 $ data           :'data.frame':    9813 obs. of  9 variables:
  ..$ id      : num [1:9813] 73557 73558 73559 73560 73561 ...
  ..$ BMI_cat : Factor w/ 4 levels "Underweight",..: NA NA NA 2 NA NA NA NA NA NA ...
  ..$ age     : num [1:9813] 69 54 72 9 73 56 0 61 56 65 ...
  ..$ gender  : Factor w/ 2 levels "Male","Female": 1 1 1 1 2 1 1 2 2 1 ...
  ..$ race    : Factor w/ 5 levels "Mexican American",..: 4 3 3 3 3 1 3 3 3 3 ...
  ..$ SDMVPSU : num [1:9813] 1 1 1 2 2 1 1 1 1 2 ...
  ..$ SDMVSTRA: num [1:9813] 112 108 109 109 116 111 105 114 112 112 ...
  ..$ WTMEC2YR: num [1:9813] 13481 24472 57193 55767 65542 ...
  ..$ diab    : Factor w/ 2 levels "Yes","No": 1 1 2 NA NA NA NA NA NA NA ...
 $ imp            :List of 9
  ..$ id      :'data.frame':    0 obs. of  5 variables:
  .. ..$ 1: logi(0) 
  .. ..$ 2: logi(0) 
  .. ..$ 3: logi(0) 
  .. ..$ 4: logi(0) 
  .. ..$ 5: logi(0) 
  ..$ BMI_cat :'data.frame':    6290 obs. of  5 variables:
  .. ..$ 1: Factor w/ 4 levels "Underweight",..: 4 2 3 3 2 2 3 4 2 4 ...
  .. ..$ 2: Factor w/ 4 levels "Underweight",..: 2 4 4 4 4 2 2 3 4 2 ...
  .. ..$ 3: Factor w/ 4 levels "Underweight",..: 2 2 2 4 4 2 2 2 2 2 ...
  .. ..$ 4: Factor w/ 4 levels "Underweight",..: 4 2 2 2 4 2 3 2 2 1 ...
  .. ..$ 5: Factor w/ 4 levels "Underweight",..: 4 4 4 3 2 2 2 4 4 2 ...
  ..$ age     :'data.frame':    0 obs. of  5 variables:
  .. ..$ 1: logi(0) 
  .. ..$ 2: logi(0) 
  .. ..$ 3: logi(0) 
  .. ..$ 4: logi(0) 
  .. ..$ 5: logi(0) 
  ..$ gender  :'data.frame':    0 obs. of  5 variables:
  .. ..$ 1: logi(0) 
  .. ..$ 2: logi(0) 
  .. ..$ 3: logi(0) 
  .. ..$ 4: logi(0) 
  .. ..$ 5: logi(0) 
  ..$ race    :'data.frame':    0 obs. of  5 variables:
  .. ..$ 1: logi(0) 
  .. ..$ 2: logi(0) 
  .. ..$ 3: logi(0) 
  .. ..$ 4: logi(0) 
  .. ..$ 5: logi(0) 
  ..$ SDMVPSU :'data.frame':    0 obs. of  5 variables:
  .. ..$ 1: logi(0) 
  .. ..$ 2: logi(0) 
  .. ..$ 3: logi(0) 
  .. ..$ 4: logi(0) 
  .. ..$ 5: logi(0) 
  ..$ SDMVSTRA:'data.frame':    0 obs. of  5 variables:
  .. ..$ 1: logi(0) 
  .. ..$ 2: logi(0) 
  .. ..$ 3: logi(0) 
  .. ..$ 4: logi(0) 
  .. ..$ 5: logi(0) 
  ..$ WTMEC2YR:'data.frame':    0 obs. of  5 variables:
  .. ..$ 1: logi(0) 
  .. ..$ 2: logi(0) 
  .. ..$ 3: logi(0) 
  .. ..$ 4: logi(0) 
  .. ..$ 5: logi(0) 
  ..$ diab    :'data.frame':    9091 obs. of  5 variables:
  .. ..$ 1: Factor w/ 2 levels "Yes","No": 1 1 2 1 1 1 2 1 1 1 ...
  .. ..$ 2: Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 2 1 1 2 ...
  .. ..$ 3: Factor w/ 2 levels "Yes","No": 2 2 1 1 1 1 1 1 1 2 ...
  .. ..$ 4: Factor w/ 2 levels "Yes","No": 2 1 1 1 1 2 1 2 1 1 ...
  .. ..$ 5: Factor w/ 2 levels "Yes","No": 2 1 1 1 2 2 1 1 2 2 ...
 $ m              : num 5
 $ where          : logi [1:9813, 1:9] FALSE FALSE FALSE FALSE FALSE FALSE ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:9813] "1" "2" "3" "4" ...
  .. ..$ : chr [1:9] "id" "BMI_cat" "age" "gender" ...
 $ blocks         :List of 9
  ..$ id      : chr "id"
  ..$ BMI_cat : chr "BMI_cat"
  ..$ age     : chr "age"
  ..$ gender  : chr "gender"
  ..$ race    : chr "race"
  ..$ SDMVPSU : chr "SDMVPSU"
  ..$ SDMVSTRA: chr "SDMVSTRA"
  ..$ WTMEC2YR: chr "WTMEC2YR"
  ..$ diab    : chr "diab"
 $ call           : language mice(data = analytic_data, m = 5, method = "pmm", seed = 123)
 $ nmis           : Named int [1:9] 0 6290 0 0 0 0 0 0 9091
  ..- attr(*, "names")= chr [1:9] "id" "BMI_cat" "age" "gender" ...
 $ method         : Named chr [1:9] "" "pmm" "" "" ...
  ..- attr(*, "names")= chr [1:9] "id" "BMI_cat" "age" "gender" ...
 $ predictorMatrix: num [1:9, 1:9] 0 1 1 1 1 1 1 1 1 1 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:9] "id" "BMI_cat" "age" "gender" ...
  .. ..$ : chr [1:9] "id" "BMI_cat" "age" "gender" ...
 $ visitSequence  : chr [1:9] "id" "BMI_cat" "age" "gender" ...
 $ formulas       :List of 9
  ..$ id      :Class 'formula'  language id ~ BMI_cat + age + gender + race + SDMVPSU + SDMVSTRA + WTMEC2YR + diab
  .. .. ..- attr(*, ".Environment")=<environment: 0x10af3210> 
  ..$ BMI_cat :Class 'formula'  language BMI_cat ~ id + age + gender + race + SDMVPSU + SDMVSTRA + WTMEC2YR + diab
  .. .. ..- attr(*, ".Environment")=<environment: 0x10af3210> 
  ..$ age     :Class 'formula'  language age ~ id + BMI_cat + gender + race + SDMVPSU + SDMVSTRA + WTMEC2YR + diab
  .. .. ..- attr(*, ".Environment")=<environment: 0x10af3210> 
  ..$ gender  :Class 'formula'  language gender ~ id + BMI_cat + age + race + SDMVPSU + SDMVSTRA + WTMEC2YR + diab
  .. .. ..- attr(*, ".Environment")=<environment: 0x10af3210> 
  ..$ race    :Class 'formula'  language race ~ id + BMI_cat + age + gender + SDMVPSU + SDMVSTRA + WTMEC2YR + diab
  .. .. ..- attr(*, ".Environment")=<environment: 0x10af3210> 
  ..$ SDMVPSU :Class 'formula'  language SDMVPSU ~ id + BMI_cat + age + gender + race + SDMVSTRA + WTMEC2YR + diab
  .. .. ..- attr(*, ".Environment")=<environment: 0x10af3210> 
  ..$ SDMVSTRA:Class 'formula'  language SDMVSTRA ~ id + BMI_cat + age + gender + race + SDMVPSU + WTMEC2YR + diab
  .. .. ..- attr(*, ".Environment")=<environment: 0x10af3210> 
  ..$ WTMEC2YR:Class 'formula'  language WTMEC2YR ~ id + BMI_cat + age + gender + race + SDMVPSU + SDMVSTRA + diab
  .. .. ..- attr(*, ".Environment")=<environment: 0x10af3210> 
  ..$ diab    :Class 'formula'  language diab ~ id + BMI_cat + age + gender + race + SDMVPSU + SDMVSTRA + WTMEC2YR
  .. .. ..- attr(*, ".Environment")=<environment: 0x10af3210> 
 $ calltype       : Named chr [1:9] "pred" "pred" "pred" "pred" ...
  ..- attr(*, "names")= chr [1:9] "id" "BMI_cat" "age" "gender" ...
 $ post           : Named chr [1:9] "" "" "" "" ...
  ..- attr(*, "names")= chr [1:9] "id" "BMI_cat" "age" "gender" ...
 $ blots          :List of 9
  ..$ id      : list()
  ..$ BMI_cat : list()
  ..$ age     : list()
  ..$ gender  : list()
  ..$ race    : list()
  ..$ SDMVPSU : list()
  ..$ SDMVSTRA: list()
  ..$ WTMEC2YR: list()
  ..$ diab    : list()
 $ ignore         : logi [1:9813] FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ seed           : num 123
 $ iteration      : num 5
 $ lastSeedValue  : int [1:626] 10403 139 -209464560 1217177579 1723988416 -1675914492 830028704 -1920259156 60153526 -743447978 ...
 $ chainMean      : num [1:9, 1:5, 1:5] NaN 3.01 NaN NaN NaN ...
  ..- attr(*, "dimnames")=List of 3
  .. ..$ : chr [1:9] "id" "BMI_cat" "age" "gender" ...
  .. ..$ : chr [1:5] "1" "2" "3" "4" ...
  .. ..$ : chr [1:5] "Chain 1" "Chain 2" "Chain 3" "Chain 4" ...
 $ chainVar       : num [1:9, 1:5, 1:5] NA 0.699 NA NA NA ...
  ..- attr(*, "dimnames")=List of 3
  .. ..$ : chr [1:9] "id" "BMI_cat" "age" "gender" ...
  .. ..$ : chr [1:5] "1" "2" "3" "4" ...
  .. ..$ : chr [1:5] "Chain 1" "Chain 2" "Chain 3" "Chain 4" ...
 $ loggedEvents   :'data.frame':    11 obs. of  5 variables:
  ..$ it  : int [1:11] 1 1 2 2 2 3 3 3 4 5 ...
  ..$ im  : int [1:11] 2 4 3 4 5 2 4 5 4 1 ...
  ..$ dep : chr [1:11] "diab" "diab" "diab" "diab" ...
  ..$ meth: chr [1:11] "pmm" "pmm" "pmm" "pmm" ...
  ..$ out : chr [1:11] "BMI_catNormal weight" "BMI_catNormal weight" "BMI_catObese" "BMI_catNormal weight" ...
 $ version        :Classes 'package_version', 'numeric_version'  hidden list of 1
  ..$ : int [1:3] 3 18 0
 $ date           : Date[1:1], format: "2025-09-25"
 - attr(*, "class")= chr "mids"
Code
print(imputed_data)
Class: mids
Number of multiple imputations:  5 
Imputation methods:
      id  BMI_cat      age   gender     race  SDMVPSU SDMVSTRA WTMEC2YR 
      ""    "pmm"       ""       ""       ""       ""       ""       "" 
    diab 
   "pmm" 
PredictorMatrix:
        id BMI_cat age gender race SDMVPSU SDMVSTRA WTMEC2YR diab
id       0       1   1      1    1       1        1        1    1
BMI_cat  1       0   1      1    1       1        1        1    1
age      1       1   0      1    1       1        1        1    1
gender   1       1   1      0    1       1        1        1    1
race     1       1   1      1    0       1        1        1    1
SDMVPSU  1       1   1      1    1       0        1        1    1
Number of logged events:  11 
  it im  dep meth                  out
1  1  2 diab  pmm BMI_catNormal weight
2  1  4 diab  pmm BMI_catNormal weight
3  2  3 diab  pmm         BMI_catObese
4  2  4 diab  pmm BMI_catNormal weight
5  2  5 diab  pmm BMI_catNormal weight
6  3  2 diab  pmm BMI_catNormal weight
Code
# First imputed dataset

Imputed_data1 <- complete(imputed_data, 1)

marginplot(
  Imputed_data1[, c("BMI_cat", "diab")],
  col = mdc(1:2, trans = FALSE),
  cex = 1.2,
  cex.lab = 1.2,
  cex.numbers = 1.3,
  pch = 19
)

Code
# Checking missingness in imputed data

summary(Imputed_data1)
       id                 BMI_cat          age           gender    
 Min.   :73557   Underweight  : 218   Min.   : 0.00   Male  :4831  
 1st Qu.:76092   Normal weight:5107   1st Qu.:10.00   Female:4982  
 Median :78643   Overweight   :1831   Median :27.00                
 Mean   :78645   Obese        :2657   Mean   :31.63                
 3rd Qu.:81191                        3rd Qu.:52.00                
 Max.   :83731                        Max.   :80.00                
                                  race         SDMVPSU         SDMVSTRA    
 Mexican American                   :1685   Min.   :1.000   Min.   :104.0  
 Other Hispanic                     : 930   1st Qu.:1.000   1st Qu.:107.0  
 Non-Hispanic White                 :3538   Median :1.000   Median :111.0  
 Non-Hispanic Black                 :2198   Mean   :1.483   Mean   :110.9  
 Other Race - Including Multi-Racial:1462   3rd Qu.:2.000   3rd Qu.:115.0  
                                            Max.   :2.000   Max.   :118.0  
    WTMEC2YR       diab     
 Min.   :  3748   Yes:7252  
 1st Qu.: 13187   No :2561  
 Median : 20965             
 Mean   : 31713             
 3rd Qu.: 37922             
 Max.   :171395             
Code
summary (merged_data)
       id                 BMI_cat          age           gender    
 Min.   :73557   Underweight  : 132   Min.   : 0.00   Male  :4831  
 1st Qu.:76092   Normal weight:2167   1st Qu.:10.00   Female:4982  
 Median :78643   Overweight   : 595   Median :27.00                
 Mean   :78645   Obese        : 629   Mean   :31.63                
 3rd Qu.:81191   NA's         :6290   3rd Qu.:52.00                
 Max.   :83731                        Max.   :80.00                
                                  race         SDMVPSU         SDMVSTRA    
 Mexican American                   :1685   Min.   :1.000   Min.   :104.0  
 Other Hispanic                     : 930   1st Qu.:1.000   1st Qu.:107.0  
 Non-Hispanic White                 :3538   Median :1.000   Median :111.0  
 Non-Hispanic Black                 :2198   Mean   :1.483   Mean   :110.9  
 Other Race - Including Multi-Racial:1462   3rd Qu.:2.000   3rd Qu.:115.0  
                                            Max.   :2.000   Max.   :118.0  
    WTMEC2YR        diab     
 Min.   :  3748   Yes : 553  
 1st Qu.: 13187   No  : 169  
 Median : 20965   NA's:9091  
 Mean   : 31713              
 3rd Qu.: 37922              
 Max.   :171395              
Code
colSums(is.na(Imputed_data1))  # check total missing values
      id  BMI_cat      age   gender     race  SDMVPSU SDMVSTRA WTMEC2YR 
       0        0        0        0        0        0        0        0 
    diab 
       0 
Code
plot_intro(Imputed_data1, title="Figure 8. Structure of variables and missing observations.")

Code
plot_bar(merged_data, title = "Figure 3. Frequency plots of categorical variables.")

Code
plot_bar(Imputed_data1, title = "Figure 9. Frequency plots of categorical variables.")

Code
plot_correlation(na.omit(Imputed_data1$BMI_cat, Imputed_data1$diab), maxcat=5L, title = "Figure 10. Correlation matrix between BMI categories and Diabetes Status.")

Code
plot_correlation(
  na.omit(data.frame(BMI_cat = Imputed_data1$BMI_cat,
                     diab    = Imputed_data1$diab)),
  maxcat = 5L,
  title = "Figure 10. Correlation matrix between BMI categories and Diabetes Status."
)

Code
# Cross-tabulation
tab <- table(Imputed_data1$BMI_cat, Imputed_data1$diab)
print(tab)
               
                 Yes   No
  Underweight    159   59
  Normal weight 3760 1347
  Overweight    1342  489
  Obese         1991  666
Code
# Chi-square test of independence
chisq.test(tab)

    Pearson's Chi-squared test

data:  tab
X-squared = 2.1289, df = 3, p-value = 0.5461
Code
ggplot(Imputed_data1, aes(x = diab)) + 
  geom_bar()

Code
glimpse(Imputed_data1)
Rows: 9,813
Columns: 9
$ id       <dbl> 73557, 73558, 73559, 73560, 73561, 73562, 73563, 73564, 73566…
$ BMI_cat  <fct> Obese, Normal weight, Overweight, Normal weight, Overweight, …
$ age      <dbl> 69, 54, 72, 9, 73, 56, 0, 61, 56, 65, 26, 9, 76, 10, 10, 33, …
$ gender   <fct> Male, Male, Male, Male, Female, Male, Male, Female, Female, M…
$ race     <fct> Non-Hispanic Black, Non-Hispanic White, Non-Hispanic White, N…
$ SDMVPSU  <dbl> 1, 1, 1, 2, 2, 1, 1, 1, 1, 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 1…
$ SDMVSTRA <dbl> 112, 108, 109, 109, 116, 111, 105, 114, 112, 112, 113, 115, 1…
$ WTMEC2YR <dbl> 13481.042, 24471.770, 57193.285, 55766.512, 65541.871, 25344.…
$ diab     <fct> Yes, Yes, No, Yes, Yes, No, Yes, Yes, Yes, No, Yes, Yes, No, …
Code
# diab within BMI categories
breakdown_BMI_cat <- Imputed_data1 %>%
  group_by(BMI_cat, diab) %>%
  summarise(Count = n(), .groups = "drop") %>%
  group_by(BMI_cat) %>%
  mutate(
    Percent = round(100 * Count / sum(Count), 1)
  )

breakdown_BMI_cat
# A tibble: 8 × 4
# Groups:   BMI_cat [4]
  BMI_cat       diab  Count Percent
  <fct>         <fct> <int>   <dbl>
1 Underweight   Yes     159    72.9
2 Underweight   No       59    27.1
3 Normal weight Yes    3760    73.6
4 Normal weight No     1347    26.4
5 Overweight    Yes    1342    73.3
6 Overweight    No      489    26.7
7 Obese         Yes    1991    74.9
8 Obese         No      666    25.1
Code
ggplot(breakdown_BMI_cat, aes(x = BMI_cat, y = Percent, fill = diab)) +
  geom_col(position = "fill") +
  scale_y_continuous(labels = scales::percent) +
  labs(
    title = "Relative Breakdown of Diabetes by BMI Category",
    x = "BMI Category", y = "Proportion"
  ) +
  theme_minimal()

Code
Imputed_data1 %>% 
  tabyl(diab, BMI_cat) %>% 
  adorn_percentages("col")
 diab Underweight Normal weight Overweight     Obese
  Yes   0.7293578     0.7362444  0.7329328 0.7493414
   No   0.2706422     0.2637556  0.2670672 0.2506586

Modeling

Code
table(Imputed_data1$BMI_cat, Imputed_data1$diab)
               
                 Yes   No
  Underweight    159   59
  Normal weight 3760 1347
  Overweight    1342  489
  Obese         1991  666
Code
table(merged_data$BMI_cat, merged_data$diab)
               
                Yes No
  Underweight     0  0
  Normal weight   5  2
  Overweight      3  2
  Obese           2  0
Code
ggplot(Imputed_data1, aes(x = BMI_cat, fill = diab)) +
  geom_bar(position = "fill") +
  labs(y = "Proportion", title = "BMI Category vs Diabetes")

Code
ggplot(Imputed_data1, aes(x = age, fill = diab)) +
  geom_histogram(binwidth = 5, position = "fill") +
  labs(y = "Proportion", title = "Age distribution by Diabetes")

Code
ggplot(merged_data, aes(x = age, fill = diab)) +
  geom_histogram(binwidth = 5, position = "fill") +
  labs(y = "Proportion", title = "Age distribution by Diabetes")

Bayesian data augmentation and Logistic regression Austin et al. (2021)

For the given raw dataset, Bayesian logistic Regression (GLM) is considered to calculate the probability of having Diabetes by incorporated default prior and using Bernoulli likelihood with a logit link is the standard, interpretable model for risk b.

  1. Priors - Bayesian augmentation stabilizes estimates via priors and provides finite, interpretable estimates and provides more reliable inferences.

    • Shrinkage is possible when including predictors (BMI, age, gender, race )

    • Encoding prior beliefs predicts estimates when data are sparse (Prior knowledge from literature suggests obesity is linked to diabetes)

  2. Bayesian modeling plays nicely with multiple imputation or joint, propagating uncertainty instead of silently dropping cases.

  3. Calculated Posterior and credible intervals for both coefficients and individual risk make clinical interpretation and shared decision-making models.

  4. Bayesian logistic GLM gives an interpretable, survey-aware, uncertainty-quantified Diabetes risk model that’s robust to sparse subgroups and easy to update.

  5. It is easy for an update, a new NHANES cycle can update the calculated posterior rather than refitting from scratch, developing a live risk model.

  6. Bayesian gives full probability statements instead of only Odds Ratio (OR) with 94% probabilistic results (credible intervals), and not just p-values.

The Logistic regression model is:

\[ \text{logit}(P(Y_i=1)) = \beta_0 + \beta_1 \cdot Age_i + \beta_2 \cdot BMI_i + \beta_3 \cdot HTN_i + \beta_4 \cdot HDL_i + \beta_5 \cdot (HTN_i \times HDL_i) \]

Linear Regression equation:

\[ y_i = \beta_0 + \beta_1 X_1 +\varepsilon_i \]

Comparison of multiple imputation and Bayesian data augmentation

Multiple imputation Bayesian data augmentation
  • frequentist approach and requires no priors, and has moderate flexibility
  • performs missing data imputation and regression model fitting simultaneously

  • Markov Chain Monte Carlo (MCMC) draws samples from the joint posterior of regression parameters, missing values and provide complete datasets by extracting posterior means, credible intervals, and probabilities

  • handles missing values first by imputation, performs regression analysis, pools results
  • performed on the data with missingness

  • shrink extreme estimates back toward plausible values

  • propagate uncertainty added after analysis (pooling).
  • handles uncertainty in missing values fully propagated through the model, naturally handles small or sparse datasets and separation problems.

Diagnostics performed before regression analysis

  1. Modeling assumption check
  2. Correlation matrix, Cook’s distance, influential points
  3. Model plot
Code
# MLR on imputed data
# assumption check (# for correlation # )
# correlation matrix

m_imp <- glm(diab ~ age + gender + race + BMI_cat,
          data = Imputed_data1, 
          family = binomial)                     ## frequentist model ##
m_imp

Call:  glm(formula = diab ~ age + gender + race + BMI_cat, family = binomial, 
    data = Imputed_data1)

Coefficients:
                            (Intercept)  
                              -0.761926  
                                    age  
                              -0.004728  
                           genderFemale  
                              -0.092730  
                     raceOther Hispanic  
                              -0.154250  
                 raceNon-Hispanic White  
                              -0.212663  
                 raceNon-Hispanic Black  
                              -0.055537  
raceOther Race - Including Multi-Racial  
                              -0.109388  
                   BMI_catNormal weight  
                               0.024339  
                      BMI_catOverweight  
                               0.071544  
                           BMI_catObese  
                               0.020287  

Degrees of Freedom: 9812 Total (i.e. Null);  9803 Residual
Null Deviance:      11270 
Residual Deviance: 11220    AIC: 11240
Code
m_raw

Call:  glm(formula = diab ~ BMI_cat + age + gender + race, family = binomial, 
    data = merged_data)

Coefficients:
                            (Intercept)  
                                 94.694  
                      BMI_catOverweight  
                                 50.968  
                           BMI_catObese  
                                -20.140  
                                    age  
                                 -9.984  
                           genderFemale  
                                 -4.011  
                     raceOther Hispanic  
                                -46.080  
                 raceNon-Hispanic White  
                                -50.787  
                 raceNon-Hispanic Black  
                                 73.273  
raceOther Race - Including Multi-Racial  
                                 13.094  

Degrees of Freedom: 13 Total (i.e. Null);  5 Residual
  (9799 observations deleted due to missingness)
Null Deviance:      16.75 
Residual Deviance: 2.259e-10    AIC: 18
Code
m_firth
logistf(formula = diab ~ BMI_cat + age + gender + race, data = merged_data)
Model fitted by Penalized ML
Confidence intervals and p-values by Profile Likelihood 

Coefficients:
                            (Intercept)                       BMI_catOverweight 
                              2.2909182                               2.5151817 
                           BMI_catObese                                     age 
                             -0.4519637                              -0.2545704 
                           genderFemale                      raceOther Hispanic 
                             -2.1675543                               0.9518795 
                 raceNon-Hispanic White                  raceNon-Hispanic Black 
                             -2.5671499                               3.3280688 
raceOther Race - Including Multi-Racial 
                              1.3072954 

Likelihood ratio test=6.066658 on 8 df, p=0.6397653, n=14
Code
# To predict log-odds odds of diabetes
Imputed_data1$logit <- predict(m_imp, type = "link")  # 'link' gives log-odds

# Plot BMI vs logit (# prediction of diabetes - log-odds at individi=ual level ##)


Imputed_data1$prob <- exp(Imputed_data1$logit) / (1 + exp(Imputed_data1$logit))
ggplot(Imputed_data1, aes(x = age, y = prob)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "loess") +
  labs(x = "Age", y = "Predicted Probability of Diabetes")

Code
# Calculate fitted values and residuals from the final model
    fitted_imputed1 <- fitted(m_imp)
    residual_imputed1 <- residuals(m_imp)
  
    # Example: fitted vs residuals plot
plot(fitted(m_imp), residuals(m_imp),
     xlab = "Fitted probabilities",
     ylab = "Residuals",
     main = "Residuals vs Fitted")
abline(h = 0, col = "red", lty = 2)  # reference line at 0

Code
# collinearity check #

library(car)
    vif(m_imp)  # VIF > 5 or 10 indicates multicollinearity 
            GVIF Df GVIF^(1/(2*Df))
age     1.108163  1        1.052693
gender  1.002032  1        1.001015
race    1.040913  4        1.005025
BMI_cat 1.078206  3        1.012629
Code
# Check for influential points and outliers # 
library(broom)
    influence_m_imp <- broom::augment(m_imp)
    influence_m_imp
# A tibble: 9,813 × 11
   diab    age gender race        BMI_cat .fitted .resid     .hat .sigma .cooksd
   <fct> <dbl> <fct>  <fct>       <fct>     <dbl>  <dbl>    <dbl>  <dbl>   <dbl>
 1 Yes      69 Male   Non-Hispan… Obese    -1.12  -0.750 0.001000   1.07 3.26e-5
 2 Yes      54 Male   Non-Hispan… Normal…  -1.21  -0.724 0.000549   1.07 1.65e-5
 3 No       72 Male   Non-Hispan… Overwe…  -1.24   1.73  0.000973   1.07 3.38e-4
 4 Yes       9 Male   Non-Hispan… Normal…  -0.993 -0.794 0.000593   1.07 2.20e-5
 5 Yes      73 Female Non-Hispan… Overwe…  -1.34  -0.682 0.000923   1.07 2.42e-5
 6 No       56 Male   Mexican Am… Normal…  -1.00   1.62  0.00105    1.07 2.88e-4
 7 Yes       0 Male   Non-Hispan… Normal…  -0.950 -0.809 0.000701   1.07 2.71e-5
 8 Yes      61 Female Non-Hispan… Overwe…  -1.28  -0.699 0.000835   1.07 2.32e-5
 9 Yes      56 Female Non-Hispan… Obese    -1.31  -0.691 0.000614   1.07 1.66e-5
10 No       65 Male   Non-Hispan… Normal…  -1.26   1.74  0.000646   1.07 2.27e-4
# ℹ 9,803 more rows
# ℹ 1 more variable: .std.resid <dbl>
Code
    Imputed_data1 <- Imputed_data1 %>%
    mutate(outlier =  if_else(abs(rstandard(m_imp))>2.5, "Suspected", "Not Suspected"))

    Imputed_data1 %>% count(outlier)
        outlier    n
1 Not Suspected 9813
Code
    cooks(m_imp)

Code
# Visualization 
ggplot(influence_m_imp, aes(x = .hat, y = .cooksd)) +
  geom_point() +
  labs(x = "Leverage", y = "Cook's Distance")

Code
Imputed_data1 %>% ggplot(aes(x = BMI_cat, y = diab, color = outlier)) +
  geom_point() + 
  scale_color_manual(values = c("#999999", "#000000")) +
  labs(x = "age", y = "diabetes", color = "Outlier") +
  theme_bw()

Code
# transform response variable to codes

Imputed_data1$diab_num <- ifelse(Imputed_data1$diab == "Yes", 1, 0)

library(ResourceSelection)
hoslem.test(Imputed_data1$diab_num, fitted(m_imp))

    Hosmer and Lemeshow goodness of fit (GOF) test

data:  Imputed_data1$diab_num, fitted(m_imp)
X-squared = 12190, df = 8, p-value < 2.2e-16
Code
summary(m_imp)

Call:
glm(formula = diab ~ age + gender + race + BMI_cat, family = binomial, 
    data = Imputed_data1)

Coefficients:
                                         Estimate Std. Error z value Pr(>|z|)
(Intercept)                             -0.761926   0.162425  -4.691 2.72e-06
age                                     -0.004728   0.001011  -4.678 2.89e-06
genderFemale                            -0.092730   0.046133  -2.010  0.04442
raceOther Hispanic                      -0.154250   0.092549  -1.667  0.09558
raceNon-Hispanic White                  -0.212663   0.067746  -3.139  0.00169
raceNon-Hispanic Black                  -0.055537   0.072128  -0.770  0.44131
raceOther Race - Including Multi-Racial -0.109388   0.080345  -1.361  0.17336
BMI_catNormal weight                     0.024339   0.156361   0.156  0.87630
BMI_catOverweight                        0.071544   0.162658   0.440  0.66005
BMI_catObese                             0.020287   0.161137   0.126  0.89981
                                           
(Intercept)                             ***
age                                     ***
genderFemale                            *  
raceOther Hispanic                      .  
raceNon-Hispanic White                  ** 
raceNon-Hispanic Black                     
raceOther Race - Including Multi-Racial    
BMI_catNormal weight                       
BMI_catOverweight                          
BMI_catObese                               
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 11267  on 9812  degrees of freedom
Residual deviance: 11219  on 9803  degrees of freedom
AIC: 11239

Number of Fisher Scoring iterations: 4
Code
anova (m_imp)
Analysis of Deviance Table

Model: binomial, link: logit

Response: diab

Terms added sequentially (first to last)

        Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                     9812      11267              
age      1  31.1696      9811      11236 2.364e-08 ***
gender   1   3.9271      9810      11232   0.04751 *  
race     4  12.2392      9806      11220   0.01566 *  
BMI_cat  3   0.7227      9803      11219   0.86784    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
library(glmtoolbox)
(adjR2(m_imp))
[1] 0.00335
Code
# plot- m_raw and m3 # residual vs fitted #

plot3 <- plot(m_imp$fitted.values, resid(m_imp),   ## residual vs fitted plot - imputed data ##
     xlab = "Fitted values",
     ylab = "Residuals",
     main = "Residuals vs Fitted",
     pch = 19, col = "blue")
abline(h = 0, col = "red", lwd = 2)

Code
plot1 <- plot(m_raw$fitted.values, resid(m_raw),   ## residual vs fitted plot -  raw data ##
     xlab = "Fitted values",
     ylab = "Residuals",
     main = "Residuals vs Fitted",
     pch = 19, col = "blue")
abline(h = 0, col = "red", lwd = 2)

Code
# See numeric codes (1 = Yes, 2 = No)
table(as.numeric(Imputed_data1$diab), useNA = "ifany")

   1    2 
7252 2561 

Findings from regression of MI data set

  1. MLR on imputed data (Frequentist approach)
  • Relationship between age and log-odds of diabetes are roughly linear but not perfectly, but are acceptable for logistic regression assumptions.

  • Generalized Variance Inflation Factor (vif- adjusted report there is no collinearity between predictors (GVIF between ~1.0–1.04) and we can run model without removing or dropping or combining variables.

  • Cooks distance and influential points, we found - Most data points are safe, not influencing the model In the data with (~9813 cases), cutoff ≈ 0.0004. A cluster at high leverage shows unusual predictor values, but not high influence. A few above Cook’s Distance cutoff: worth checking individually, but no major threat to model stability. no outliers detected (not suspected = 9813)

  • Results from Hosmer–Lemeshow (H–L) at alpha =0.05, with p < 0.001, we find our logistic regression model does not fit the data well.

  • Graph shows Residual vs fitted (imputed data model)

  1. Results from Bayesian Data Augmentation and logistic regression
  • We incorporate prior knowledge that BMI increases diabetes odds by .,
  • We use priors for Bayesian logistic regression and compare the models with different priors in the model
    • Prior (intercept) - We use intercept prior from this study data
    • Prior (coefficients) - BMI, age, gender
      • Weak prior N (0,2.5) -βBMI∼N(μ,σ2)
      • A common approach is to use a normal distribution, βBMI∼N(μ,σ2), for the regression coefficient. 
      • informative prior from previous studies βBMI∼N(μ,σ2) , βage∼N(μ,σ2), βgender∼N(μ,σ2), βrace ∼N(μ,σ2)

For males, the informative prior Ali et al. (2024), we use is

Normal(μ = 1.705, σ² = 0.448²).

In m1 = race is a significant predictor

In m2 = age is strongly significant, race, and gender are slightly significant predictors

  • The cluster of points representing the higher probability of diabetes appears to be denser among individuals in the middle to older age ranges (e.g., roughly from 40 to 80 years old), compared to the younger age ranges, although diabetes is present even at younger ages.

Comparing Models

  • Linear regression model on raw data

  • Bayesian logistic regression on imputed dataset (MI)

  • Bayesian data augmention and logistic regression

Conclusion

  • Summarize your key findings.

  • Discuss the implications of your results.

References

Ali, Sakhawat, Rizwana Hussain, Rohaib A Malik, Raheema Amin, and Muhammad N Tariq. 2024. Association of Obesity With Type 2 Diabetes Mellitus: A Hospital-Based Unmatched Case-Control Study.” Cureus 16 (1): 1–7. https://doi.org/10.7759/cureus.52728.
Austin, Peter C., Ian R. White, Douglas S. Lee, and Stef van Buuren. 2021. Missing Data in Clinical Research: A Tutorial on Multiple Imputation.” Canadian Journal of Cardiology 37 (9): 1322–31. https://doi.org/10.1016/j.cjca.2020.11.010.
Buuren, Stef van, and Karin Groothuis-Oudshoorn. 2011. mice: Multivariate Imputation by Chained Equations in R.” Journal of Statistical Software 45 (3): 1–67. https://doi.org/10.18637/jss.v045.i03.
Center for Health Statistics, National. 1999. Vital and Health Statistics Reports Series 2, Number 161, September 2013.” National Health and Nutrition Examination Survey: Analytic Guidelines, 1999–2010. https://wwwn.cdc.gov/nchs/data/series/sr02_161.pdf.
Chatzimichail, Theodora, and Aristides T. Hatjimihail. 2023. A Bayesian Inference Based Computational Tool for Parametric and Nonparametric Medical Diagnosis.” Diagnostics 13 (19). https://doi.org/10.3390/DIAGNOSTICS13193135,.
D’Angelo, Gina, and Di Ran. 2025. Tutorial on Firth’s Logistic Regression Models for Biomarkers in Preclinical Space.” Pharmaceutical Statistics 24 (1): 1–8. https://doi.org/10.1002/pst.2422.
Klauenberg, Katy, Gerd Wübbeler, Bodo Mickan, Peter Harris, and Clemens Elster. 2015. A tutorial on Bayesian Normal linear regression.” Metrologia 52 (6): 878–92. https://doi.org/10.1088/0026-1394/52/6/878.
Leeuw, Christiaan de, and Irene Klugkist. 2012. Augmenting Data With Published Results in Bayesian Linear Regression.” Multivariate Behavioral Research 47 (3): 369–91. https://doi.org/10.1080/00273171.2012.673957.
Liu, Yi Ming, Sam Li Sheng Chen, Amy Ming Fang Yen, and Hsiu Hsi Chen. 2013. Individual risk prediction model for incident cardiovascular disease: A Bayesian clinical reasoning approach.” International Journal of Cardiology 167 (5): 2008–12. https://doi.org/10.1016/J.IJCARD.2012.05.016.
Schoot, Rens van de, Sarah Depaoli, Ruth King, Bianca Kramer, Kaspar Märtens, Mahlet G. Tadesse, Marina Vannucci, et al. 2021. Bayesian statistics and modelling.” Nature Reviews Methods Primers 1 (1): 1. https://doi.org/10.1038/s43586-020-00001-2.
Van Buuren, Stef, and Stef Van Buuren. 2012. Flexible Imputation of Missing Data. Vol. 10. CRC press Boca Raton, FL.